The fundamental problem of multiple secondary users contending foropportunistic spectrum access over multiple channels in cognitive radionetworks has been formulated recently as a decentralized multi-armed bandit(D-MAB) problem. In a D-MAB problem there are $M$ users and $N$ arms (channels)that each offer i.i.d. stochastic rewards with unknown means so long as theyare accessed without collision. The goal is to design a decentralized onlinelearning policy that incurs minimal regret, defined as the difference betweenthe total expected rewards accumulated by a model-aware genie, and thatobtained by all users applying the policy. We make two contributions in thispaper. First, we consider the setting where the users have a prioritizedranking, such that it is desired for the $K$-th-ranked user to learn to accessthe arm offering the $K$-th highest mean reward. For this problem, we presentthe first distributed policy that yields regret that is uniformly logarithmicover time without requiring any prior assumption about the mean rewards.Second, we consider the case when a fair access policy is required, i.e., it isdesired for all users to experience the same mean reward. For this problem, wepresent a distributed policy that yields order-optimal regret scaling withrespect to the number of users and arms, better than previously proposedpolicies in the literature. Both of our distributed policies make use of aninnovative modification of the well known UCB1 policy for the classicmulti-armed bandit problem that allows a single user to learn how to play thearm that yields the $K$-th largest mean reward.
展开▼
机译:最近,在认知无线电网络中,多个次要用户争夺机会性频谱访问的基本问题被提出为分散多武装匪徒(D-MAB)问题。在D-MAB问题中,有$ M $个用户和$ N $个臂(通道),每个都提供i.i.d.只要获得的收益无冲突,随机奖励就具有未知的含义。目的是设计一种分散式的在线学习策略,该策略将引起最少的遗憾,该策略定义为模型感知精灵收集的总预期奖励与应用该策略的所有用户获得的预期奖励之间的差。我们在本文中做出了两点贡献。首先,我们考虑用户具有优先排序的设置,以便希望排名第K $的用户学习访问提供第K $的最高平均奖励的手臂。针对此问题,我们提出了第一个分布式策略,该策略在不要求对均值奖励进行任何先验假设的情况下随时间均匀对数。第二,我们考虑需要公平访问策略的情况,即希望所有用户都能体验到相同的平均奖励。针对此问题,我们提出了一种分布式策略,该策略相对于用户和武器数量产生了顺序最优的后悔扩展,其效果优于文献中先前提出的策略。我们的两个分布式策略都利用对著名的UCB1策略的创新修改来解决经典的多臂强盗问题,该问题使单个用户可以学习如何使用产生最大$ K $平均奖励的武器。
展开▼